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Abstract 



The notion of well-separated sets is crucial in fast multipole methods as the main idea is to approximate the 
interaction between such sets via cluster expansions. We revisit the one-parameter multipole acceptance criterion 
in a general setting and derive a relative error estimate. This analysis benefits asymmetric versions of the method, 
where the division of the multipole boxes is more liberal than in conventional codes. Such variants offer a particularly 
elegant implementation with a balanced multipole tree, a feature which might be very favorable on modern computer 
architectures. 
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1. Motivation 

We consider in this paper a general error analysis and some implementation issues for fast multipole methods 
(FMMs). Since their first appearance in fslQ], these tree-based algorithms have become important computational 
tools for evaluating pairwise interactions of the type 



FMMs offer an O (Af)-complexity and an a priori error estimate, but implementing a fully fledged adaptive FMM in 
3D is a daunting task I?]. Parallelization issues complicate matter even further and call for a balance between, on 
the one hand theoretical efficiency, and on the other hand software complexity 1 14, 16]. A major inconvenience with 
adaptive versions is the complicated memory access pattern which is due to the communication between levels in 
the multipole tree. Although this can be mitigated either through post-balancing algorithms or by employing 
more advanced data-structures [9j Chap. 6.6], with modern data-parallel architectures it has in fact been suggested 
that uniform versions off'er better performance [10]. 

An alternative is to use asymmetric adaptive meshes as outlined in Figures 11.11 and 14.11 Here the tree becomes 
balanced at the cost of a variable, but local, communication stencil. Also, the actual form of the multipole acceptance 
criterion becomes more critical as the mesh looses regularity. Comparisons between different criteria for cell-to- 
particle methods are found in fl3], while a careful worst-case analysis of uniform FMMs is found in (IjI]. 

The main contributions of the current paper are found in Sections |2] and [3] where the precise statement of the 
multipole acceptance criterion and the required assumptions on the kernel G are discussed together with a general 
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error analysis. Our approach is inspired by the treatment in the monograph 181], but we are able to offer several 
important corrections. In Sections |4] and |5] implementation issues are highlighted and we also perform numerical 
experiments illustrating the sharpness of the theory and the efficiency of the proposed approach. 



Figure 1.1: Sample asymmetric adaptive multipole 
mesh in 2D. The discretization is obtained by suc- 
cessively splitting boxes along each coordinate axis 
in such a way that the number of sources in the four 
resulting boxes is very nearly equal. In this exam- 
ple the grey boxes are to interact through cluster- 
to-cluster interactions with the black box. That is, 
they satisfy the 9-criterion ((9=1 /2). 




2. Well-separated sets and kernel assumptions 

We begin by stating the one-parameter criterion for two sets being well-separated. This is also a correction to the 
too weak criterion found in ||8t Eq. (8.31), p. 379]. 

Criterion 2.1 (0-criterion). Let the sets S 1,82 c be contained inside two disjoint spheres such that \\S 1 - xqW < r\ 

and \\S2 — yoll ^ '"2- Given e (0, 1), \f d = \\xq - yoll, R = max{ri,r2), and r = min{ri,r2), then the two sets are 
well-separated whenever R + Or < 0d. 

In other words, any of the two sets may be expanded by a factor of 1/0 and arbitrarily rotated about its center point 
without touching the other set. When regarded as a parameter inside the FMM it will be evident that a smaller yields 
a smaller error at the cost of a larger communication stencil. 

We remark that the ^-criterion under consideration is symmetric in its two arguments as a reflection upon the 
fact that we mainly consider FMMs where the sources and the evaluation points are "roughly" the same. — Indeed, 
our implementation as outlined in Section |4] produces a representation of the total field in the whole enclosing box 
under consideration. An algorithm which is adaptive in sources and evaluation points separately is described in ^ 
Chap. 6.6; see also Fig. 5.1]. 

We shall need the following two simple consequences of the ^-criterion: since < 1 we get 



d + r 



R d + r + R/0 
< — < 



2d 



d-R d-e(d-r) (\ - ff)d + Or \-6 
Furthermore, writing the 0-criterion as r < 9{d - R) + {9 - \){R - r), we also see that 

r R 



d-R d — r 



(2.1) 



(2.2) 



For a and jS multi-indices in the normed space (Zf , | ■ |) we write for brevity Gafi{x,y) - d" xd^yG(x,y) and 
additionally define factorials and powers in the usual intuitive manner. Similarly to Eq. (8.15), p. 317] (but here 
with an explicit factor n\) we assume that the kernel G's first few derivatives are equivalent to the harmonic potential: 



Assumption 2.2 (Kernel regularity). For ff,/? e with \a\, \ 

|Ga,/j(x,y)| < C- 



P + l, 



\x-y\\" 



(2.3) 
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where n = \a + 13\. Additionally, G is positive and satisfies 

\\x-y\\-' <cG{x,y). (2.4) 

The analysis below takes place in D-dimensional Cartesian space. In order to avoid the worst-case bound || ■ ||i < 
Vd|| ■ II, assuming some kind of symmetry of the kernel seems inevitable (see also 114J): 

Assumption 2.3 (Rotational invariance). For any rotation T of the coordinate system, 

\G{x,y)\^\G{Tx,Ty)\. (2.5) 

In other words, we may freely rotate the coordinate system before expanding the kernel (provided of course that 
any expansion points are rotated as well). By choosing a T such that ||rx||i = ||x||, results obtained below in || ■ ||i will 
be transferred to the Euclidean norm without introducing any constants. 



3. Analysis 



Consider now two points x e S \ and y e S2, where S \ and 5 2 satisfy the 0-criterion. Approximating a unit source 
at y using a far-to-far translation (G — > Gp, centered at yo), followed by a far-to-local expansion {Gp — » Gp, centered 
at xq), can be written as 



G{x,y) = Gp{x,y; jccyo) - \Gp{x,y; xo,yo) - Gp{x,y; yo)] - \Gp{x,y; yo) - G(jc,3')] 

-'■ Gp{x,y; XQ,yo) - -Efai-.to-local - -Efar-to-far. 

with p the order of the expansion. We consider the two errors in turn. 

The integral form for the remainder of the D-dimensional Taylor series becomes 



(3.1) 
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using ( 12.31 ) and the multinomial theorem. As discussed in conjunction with Assumption l2.3l above we may now replace 
\\y - yolli with \\y - yoll- Using the triangle inequality and (12.2b yields \\y - yoll/lk - Joll < 6* so that from ( 12.41) . 



1 ^ ^, Jx-y\\ ^ , \\y-yo\\ 

< cG(x,y)- < cG(x,y)ll + 

• \ \\x-yo\\ 



\\x-yo\\ \\x-yo\\ 
Using l-f<l-0f<l-f||3'- yoll/lk - yoW we readily bound the integrand and get 

1 +1 



< cG(x,y)(l + 0). 



|£far-to-farl < cC(p + 1) 0"^' 



1 



7Gix,y). 



(3.2) 



(3.3) 



As for the first term in ( 13. Il l we obtain this time a sum of integral remainders, 

(x-xor(y-yof 



l-Efar-to-locall = (/'+!) 
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where the multinomial theorem and the rotational invariance were used twice. The sum can be evaluated when the 
upper limit tends to oo; hence by uniform convergence. 



|£far-to-locall<C(p+l)||^-Xoir' f ; —^dt 

Jo (||xo + t(x - xq) - yoW - \\y - yaWf^'' 
+ ^ — ^ I ^ — : — -pi'^^- 



r (1 - ty 

'iWxQ - Mil - \\y - yolir' Jo (i _ t„ „\ 

Using ( 12.41 ). the triangle inequality, and ( 12.11 ) we get (compare with ( 13.21 )) 



1 ^ ^ ll-^-yll , 2 

< cG(x,y)- — < cG{x,y)- . (3.4) 



Iko -yoll - lly-yoll IN-yoll-lly-yoll l- 

For the integrand we use the same type of bound as before and finally get 

|£far-to-locall < cC(P + 1) 0'^' _i_G(x,y). (3.5) 

(1 - ey 

By summing the contributions ( 13.3b and ( 13.51 ) from A' positive potentials (cf. ( 12.41 )) we thus conclude that the 
relative error for the pth order fast multipole method under the 0-criterion is bounded by a constant x 9''^ ' /( 1 - S)^- 

In the above derivation we are lead to the 0-criterion by the initial requirement that llj - yoll/ll-'c - yoll ^ &■ Clearly, 
the only natural way to obtain this is from the the triangle inequality and the slightly stronger requirement that \\y - 
3'oll/(ll'«^o - yoll - 11-^ - -^^oll) ^ 9 (cf. ( 12.21) ). Despite this clear reasoning we have not been able to find our version of the 
0-criterion made explicit in the literature. Presumably, this is due to the fact that quadratic meshes are so popular 

We note also that the integral remainder term was consistently used instead of the Lagrangian version (hinted at 
also in ||T2, 17]). To see why, note that in bounding e.g. fifar-to-far above we would otherwise obtain terms of the form 
Gofi(x,yo + ^(y - yo)) with f e [0, 1] and |;S| = p + I. The only sensible bound now includes the factor (1 - 0)"*''+^* 



suggesting that the convergence would deteriorate as — > 1/2. By contrast, the bounds ( 13.31 ) and ( 13.51 ) are perfectly 
regular for any Q < < I. 

It is to be stressed that we like to view the above analysis as a kind of template; the precise form of the assumptions 
and their implications are all very clearly visible. The effect of changing them can therefore readily be assessed. 



4. Implementation 

The current paper on the 0-criterion stems mainly from the fact that allowing an asymmetric adaptive splitting 
when creating the FMM mesh makes a convenient implementation possible. The ease of implementation is mainly 
thanks to the fact that the associated multipole tree is always balanced so that a static memory layout becomes natural. 
As a result, neighboring boxes are always arranged in the same level in the multipole tree, thereby facilitating the 
communication between them (see Figure|4T|for an illustration). 

We now give a brief description of our current implementation. For more detailed information the interested reader 
is kindly referred to the freely available code itself (see Section lSTt . 

The multipole mesh is constructed first and is obtained by recursively splitting the source points at or near their 
median in each coordinate direction. A very fast and well-known algorithm is available for doing just this; this is 
the median- of -three selection algorithm which is most often used when implementing quicksort 1 13, Chap. 9]. The 
source points themselves are naturally stored in a pyramid data-structure, a 2"^ -tree cut at a certain maximum level |0, 
Chap. 5.3]. 

After the sources have been assigned a box at the finest level, the connectivity information is determined. At any 
level in the tree the boxes are either decoupled or, respectively, strongly/weakly coupled. For each box b, the strong 
connections S (p) of its parent box p are examined; all children of a box in S (p) that satisfy the ^-criterion with respect 
to b become weakly coupled to b — the rest remain strongly coupled. This simple rule together with the fact that a box 
is always strongly connected to itself recursively defines the connectivity for the whole tree. The resulting topology 
is conveniently stored in sparse matrices with sparsity patterns known before each level is to be examined. 
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In the downward/upward phases the expansions are shifted as usual according to parent/child relations. The critical 
far-to-local shift follows the weak connectivity pattern and, at the finest level, the remaining strong connections are 
evaluated directly. As suggested in [jSi,^ all shifts in the implementation tested below relies on BLAS Level 3 routines 
with constant transition matrices (using pre- and post-scaling according to the local geometry). 

A few structural optimizations are optionally possible. For instance, a box weakly coupled to all children of a 
parent box can often interact via the latter instead (whenever the ^-criterion with respect to that parent is true). This 
is visible in Figure [TTT] where some of the weakly coupled grey boxes are noticeable large — here the interactions are 
in fact managed via parent boxes. A related idea at the finest level is to investigate strongly coupled boxes of very 
different radii, r <sc 7?, say. If the 0-criterion is true when the roles of r and R are exchanged, then the sources in the 
larger box can be directly converted into a near field expansion in the smaller box, and the outgoing expansion from 
the smaller can simply be evaluated at each point in the larger box. This optimization was suggested already in |3]. 

Since it is beneficial to admit as large set of boxes as possible into the set of weak connections, it is natural to try to 
somehow locally relax the 0-criterion. Rather than uniformly enforcing a single value of 6 for example, different 0,'s 
may well be accepted provided that an estimate of the total error remains bounded. For instance, if n observed values 
of 0i satisfy 2,- (f^^^ In < 0''^', then the relative error estimate is still O{0''^^^. Ideas along these lines are discussed in 




Figure 4.1: Standai'd midpoint adaptivity (left) vs. asymmetric adaptivity (right) for ttie same set of point sources (not siiown). In botli cases the 
boxes that are strongly connected to the black box are shaded grey (6 = 1/2). In the conventional approach, strongly connected boxes of different 
sizes imply cross-level communication in the multipole tree. By splitting the boxes at the median point source rather than at their geometrical 
midpoints, those boxes always belong to the same level. 



We conclude this section by very briefly discussing the algorithmic complexity. It is known that there typically ex- 
ists extremely non-uniform distributions of source points that make most tree-codes (even the adaptive ones) run in 
time proportional to O {n^^ I !]■ Practical experience is usually much better, although for certain applications, simpler 
algorithms are occasionally preferred (see [2], and further the discussions in fSl Chap. 8.7], and [|9t Chap. 6.6.3]). 

In any case, the serial complexity for a 2D implementation using asymmetric adaptivity can be estimated to be 
roughly proportional to ■ N, since each of about boxes at the finest level is to interact through cluster-to-cluster 
interactions with on the order of boxes at the same level, and since each such shift requires on the order of 
operations. For a given relative tolerance TOL, the analysis in Section [3] implies p ~ logTOL/logfl, so that the 
complexity is (9 (^0^^ log"^ ■ log^ TOL), thus indicating that the choice 6 = exp(-l) x 0.368 is nearly optimal. In 
practice, the best value of 6 as well as the optimal number of subdivisions is rather dependent on the hardware and 
should be determined from experience. On balance we have found that the convenient choice 6 - 0.5 and subdividing 
the points until the number of sources per box is ~ 20 works very well in practice. 
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5. Experiments 

As illustrations to the analysis in Section [3] and in order to highlight some of the benefits with the proposed 
adaptivity we report some results from our two-dimensional implementation which employs the classical complex 
polynomial/multipole representation QHl- 

Firstly, we investigated the sharpness of the ^-criterion and the accompanying error analysis. For this purpose, 
the complex- valued field G(zi,Zj) = -mj/(zi - zj) was used and we calculated the total force in a system consisting 
of a million sources (see Figure |5T| for results and further details of this simulation). Since G is complex-valued the 
assumption ( 12.4b is clearly violated. Cancellation effects therefore implies that relative error estimates may well be 
impossible to obtain. Nevertheless, since the points are distributed randomly and an irregular mesh is used the effect 
is negligible in this case (although somewhat more prominent for the uniform distribution, see Figure ISTt . 



Figure 5.1: Relative error as a function of p for 
e e {0.25, 0.5, VaS) and for two different distri- 
butions. A million point sources of total mass 1 
are distributed inside the unit circle f i) uniformly, 
and ( ii) non-uniformly with radial density oc 1 / r 
(see Figure[TTT]for the resulting mesh in the latter 
case). For convenience, the error is measured in 




a random sample (M = 1000) of points. Order p 

Secondly, we evaluated the efficiency of the proposed adaptivity at an effective relative tolerance of about 10"^ 
(using p -20 expansion coefficients). Since our code is developed from a highly optimized uniform code outlined in 
||6|], a reasonably fair comparison is possible. Evidently, a square and uniformly subdivided multipole mesh implies a 
completely regular access pattern so that a highly efficient implementation using direct addressing techniques is pos- 
sible. It is therefore of some interest to estimate under what circumstances adaptivity is actually beneficial. For this 
purpose we measured the speedup achieved by the adaptive code when the point sources were sampled from increas- 
ingly non-uniform data. The results are displayed in Figure Isl and shows that only for very uniform distributions of 
points is there a small gain (< 15%) in using the uniform code. 

As a third and final experiment we assessed the performance of the adaptive code for different point distributions. 
Using again a relative tolerance of 10"^, we measured the CPU-time for increasing number of points sampled from 
three very different distributions. As shown in Figure 15731 the code is very robust indeed and scales well at least up to 
some 5 million point sources on a single CPU. 

5.1. Reproducibility 

Our two-dimensional implementation as described in Section]?] is available for download via the author's web- 
pag^. The code has been tested on several platforms, is fully documented, and comes with a convenient Matlab 
mex-interface. Along with the code, automatic Matlab-scripts that repeat the numerical experiments in Section|5]are 
also distributed. The results presented here were all obtained with a 3.06 GHz Intel Core 2 Duo processor with 4GB 
of internal memory running under Mac OS X 10.6.6. 



^,http : //user ■ it .uu. se/- stef ane/f reewaxe 
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Figure 5.2: The benefits of adaptivity: speedup of the 
adaptive FMM versus an optimized uniform FMM. 
The hai'monic potential is evaluated at 200,000 ran- 
dom points, chosen either from a normal distribution 
with variance <t- or from a combined 'layer' distribu- 
tion where the jc-coordinate is uniformly distributed 
in [0, 1], and the y-coordinate again is 
distiibuted. For ease of comparison, both distribu- 
tions are forced by rejection to fit exactly within the 
unit square. As o" — > 0, stronger clustering around 
the origin and the .v-axis occurs, respectively, and the 
benefits with adaptivity quickly show. 




6. Conclusions 

Asymmetric adaptive meshes offer convenient and efficient FMM implementations. If cluster-to-cluster interac- 
tions are restricted to boxes for which the 0-criterion is true and if the kernel satisfies the assumptions outlined in 
Section |2] then the relative error can be bounded by a constant x ~ S)^ for 6 e (0, 1) and with p the num- 

ber of expansion terms. The computational complexity of the resulting 2D-algorithm can be expected to be about 
O[0-^ log--6l-A?log2TOL) for some target tolerance TOL. Not only is the actual performance of the algorithm 
competitive with optimized uniform FMMs even for relatively uniform data, but it is also robust for non-uniform 
distribution of points. Ongoing work includes porting the code to manycore platforms. 
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